arXiv:1503.06610vl [cs.DS] 23 Mar 2015 


Efficient Generation of Stable Planar Cages for 

Chemistry 


Dominique Barth, Olivier David, Franck Quessette, Vincent Reinhard, Yann 

Strozecki, Sandrine Vial * 

Universite de Versailles Saint-Quentin 


Abstract. In this paper we describe an algorithm which generates all 
colored planar maps with a good minimum sparsity from simple motifs 
and rules to connect them. An implementation of this algorithm is avail¬ 
able and is used by chemists who want to quickly generate all sound 
molecules they can obtain by mixing some basic components. 


1 Introduction 

Carbon dioxide, as well as methane can be absorbed by large organic cages [1]. 
These cages are formed by spontaneous assembly of small organic molecules, 
called motifs, bearing different reacting centres. The prediction of the overall 
shape of the cage that will be obtained by mixing the starting motifs is rather 
difficult, especially because a given set of reacting partners can lead to very 
different cages. It is hence crucial for chemists to have an operating tool that is 
capable of generating the many shapes of cages accessible from predetermined 
molecular motifs. 

In this paper we present the algorithms we have designed and implemented to 
generates molecules that are much larger and less regular that what the chemists 
usually design by hand. The molecules are modelled by maps i.e. planar embed¬ 
dings of planar graphs, as explained in Sec. The use of maps may seem un¬ 
suitable since they do not represent spatial positions. Though, planar maps are 
a good model for spherical topologies and the embedding capture the rigidity of 
the motifs. We must also be able to select the most relevant molecules among the 
huge number we generate. In Sec. |3.4[ we characterize what a “good” molecule is 
through graph parameters which are then used to filter the best molecules. The 
relevance of our modeling and of our parameters is validated by the results we 
obtain: All small molecules (5-10 motifs) we generate and consider to be good 
according to our parameters have been studied before by chemists. Some of the 
very regular molecules of medium size (10-20 motifs) we generate correspond to 
the largest cages chemists have ever produced. We also have produced cages of 
shape unknown to chemists that they now try to synthesize (see Sec. |^. 

The aim of this paper is the generation of all colored planar maps up to 
isomorphism representing possible molecules obtained from a set of elementary 
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starting motifs (colors). As with all enumeration problems, one difficulty is to 
avoid to produce a solution several times. Moreover the number of solutions may 
grow exponentially with their size, it is here the case for all bases of motifs but 
the most contrived. The complexity of such enumeration problems must then 
take into account the number of produced solutions (see [2] for more details on 
enumeration). 

We say that an algorithm is in polynomial total time if its complexity is 
polynomial in the number of solutions and polynomial in the size of the produced 
solutions. In our context, where the number of solutions is always exponential in 
their size, we are interested in linear total time algorithms. The best algorithms 
are in constant amortized time (CAT): the algorithm uses on average a constant 
time to generate each solution. This kind of efficient algorithms exists for simple 
enumeration problems such as listing all trees [5]. We may also want to bound 
the delay that is the time between the production of two consecutive solutions. 
Good algorithms have a delay polynomial, linear or even constant in the size of 
the generated solutions. 

There exist numerous works on enumeration and generation of planar maps [3] , 
but none of them deals with the generation of planar maps built with a set of 
starting motifs and color constraints. Moreover, most of the literature deals with 
non-constructive tools [5] or yields algorithms which are not in polynomial total 
time. There are a few programs such as plantri and CaGe [7] which generate 
efficiently some particular class of planar graphs such as cubic graphs or graphs 
with bounded size of face but they are not general enough for our purposes. 

The algorithm we present in Sec. |^is far from being in polynomial total time 
since we are not able to bound the number of isomorphic copies of each solution 
we generate. However, we will present several subroutines used in our algorithm 
which are either CAT, for instance the generation of paths and almost foldable 
paths in Sec. 3.1 or in linear delay such as the folding of unsaturated maps of 
motifs in Sec. 3.2 Moreover, we study several heuristics and improvements which 
makes the enumeration feasible for maps of medium size. Sec. [^presents numer¬ 
ical results which supports this assertion and illustrates the relative interest of 
our heuristics. 


2 Modeling of the problem 


In this section, we propose the modeling of our problem by maps. A map is a 
connected planar graph drawn on the sphere considered up to continuous defor¬ 
mation. Note that by Steinitz’s theorem, when a planar graph is 3-connected, 
there is only one corresponding map, but otherwise there may be several of 
them. It is relevant to distinguish between two maps with the same underlying 
graph, since the geometrical informations contained in the maps are useful to 
the chemist who are interested in their 3D representation. All maps used in this 
paper are vertex-colored maps. The representation of a map is a graph and a 
cyclic order of the neighbors around each vertex. 
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We first model the basic chemical elements with maps we call motifs. Then 
the motifs are assembled to form a map of motifs and from this map we derive 
a molecular map that is a more faithful model of the molecular cages we try to 
design. 

We use a finite even set of colors A — {a, a, b,b,c,c, ■ ■ ■} where each positive 
color a in A has a unique complementary negative color denoted by a and a is 
the complementary color of a. Each color represents a different kind of reacting 
center. Let us give the dehnition of motifs. 

Definition 1. A map G = (14 LI V, E, next) is a motif if (1) 14 contains only 
one vertex c called the center, (2) each vertex in V is colored with a color 
in A, (3) E = {(c, m), u G V}, and (4) next gives an order on the edges 
of c: next{{c,u)) = {c,v) means that the edge {c,v) is ’’following” the edge 
{c,u) in a clockwise drawing of G. For all k < \V\, nexif{{c,u)) {c,u) and 
((c, u)) = {c,u). 

Note that a motif is a star graph. We assume as input A4 a finite set of motifs 
all different. Each motif is identified by a distinct color from an alphabet Am 
disjoint from A induced by the colors existing in A4. Fig. gives examples of 
motifs. 



Fig. 1 . Example of motifs on Am = {Y, I, X, V) J} and A = {a, a, b, b}. 

Definition 2. A connected planar map G = (14 UI 4 E, next) is a map of motifs 
based on AA if, (1) the closed neighborhood of each vertex in 14 is a motif, 
(2) each vertex in V is connected to exactly one vertex in 14 and at most one 
vertex in V. If u and v in V are connected, the colors of u and v must be 
complementary. The number of vertices in 14 is called the size ofG. 

Note that each motif of AA may appear any number of times in a map of 
motifs, it may also be not present. A motif is a map of motifs of size 1. In a map 
of motifs, a vertex of degree 1 in E is called a free vertex. A map of motifs with 
no free vertex is called saturated otherwise it is called unsaturated. 

In our implementation, we have an ordering of the edges around each element 
of 14 consistent with next has been fixed. For optimal performances, we use in 
our implementation a rotation map to represent a map of motif. For each vertex 
Cl G 14, it maps the edge of ci, which connects ci to u, to a triplet (a,C2,j) 
where a is the color of u, (ci,u, v, C 2 ) is a path with C2 S 14 and the edge {c 2 ,v) 
is the of C2. The color of v is necessarly a and is thus not represented. When 
u is not connected to another vertex, C 2 and j are set to a default value. 

Based on a saturated map of motifs we construct the molecular map that is 
the graph model of the cages. 
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Fig. 2. Example of two maps of motifs based oa A4 — {Y, I}, the first map is unsatu¬ 
rated while the second map is saturated. 


Definition 3. Let G = (14 U V,Eg, nexta) he a saturated map of motifs based 
on j\4, we define the molecular map M as the map G where all paths of size 
three between vertices of 14 are replaced by an edge. 



Fig. 3. The molecular map corresponding to the saturated map of motifs in Fig. 


3 Description of the algorithm 


The aim of this paper is to solve the following problem: given a base of motifs 
At and an integer n, enumerate all molecular maps of size n based on At. The 
complexity depends only on n since the size of At and the size of its elements are 
assumed to be small constants (usually less than 4). In this section, we describe 
an algorithm which solves this problem and explain in details its two main steps. 

The first one, the concatenation, consists in adding edges between comple¬ 
mentary vertices of two maps of motifs in such a way the result is still a map of 
motifs. In this paper, we always concatenate a single motif to a map of motifs. 


see [5] for other concatenations. Sec. 3.1 presents the different strategies of con¬ 
catenation. The second, the fold or folding, consists in adding an edge between 
two complementary vertices of a map of motifs, in such a way the result is a 
map of motifs. Sec. |3. 2 [ presents an efficient approach to folding that we use to 
saturate the maps obtained by concatenation. Then, Sec. |3.3| explain how we 
detect and discard isomorphic copies of the same graph. Finally in Sec. |3.4[ we 
introdnce the indices which characterize a good molecular map and explain how 
we compute them. 
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3.1 Backbone generation 

The first step is to generate all backbones, that is unsaturated maps of motifs of 
a given size n which are of a very simple shape. The aim is that, by folding these 
backbones in a second step, we will recover all saturated maps of motifs. Since 
every map of motifs have a spanning tree, we can choose trees as backbones and 
be sure to recover all saturated maps. But for performance reason, we will also 
use paths and cycles as backbones. This turns out to be good heuristics, speeding 
up considerably our algorithm while only mildly reducing the set of generated 
maps of motifs. We would also like to restrict the backbones to those which can 
be folded into some saturated map. We address this problem by enumerating 
only what we call the almost foldable backbones, with a complexity as good as 
for the generation of regular backbones. This new algorithm greatly improve the 
computation time. 

Spanning tree. In a first version of our algorithm [S], the set of non isomorphic 
trees of size n was explicitly stored. To produce the set of trees of size n + 1, a 
single motif of every possible color was concatenated to each free vertex of each 
tree of size n. This generates all trees of size n + 1, but the drawback is that 
some trees are generated several times. The algorithm was thus not in linear total 
time and we needed to do an isomorphism test on every generated tree. We now 
generate all trees where the root and its first edge are fixed with a simple CAT 
algorithm. This method generates a tree as many times as edges in the tree: one 
for each choice of a vertex as root and for each choice of first edge of this root. 
Therefore, the implemented algorithm do not need to store the trees which are 
produced on the fly, and has a linear delay. A way to further improve this would 
be to use ideas from CAT algorithms which generate unrooted trees [3]. The 
main idea is to choose as root the centroid of the tree. However we have to deal 
with a second and harder problem: we generate maps of motifs and their vertices 
are colored. We can generate all maps of motifs sharing the same underlying tree 
efficiently but they may turn out to be isomorphic. 

Hamiltonian paths. Since generating trees is not easy, we propose to use simpler 
objects as backbones, here maps of motifs such that all vertices of 14 are on a 
path. These maps are caterpillar trees, but since the elements of 14 on the central 
path entirely determine the elements at distance one, we will consider them as 
paths and call them so. There are two advantages to generating paths instead 
of trees: they are easier to generate and their number is smaller. The drawback 
is that not any planar graph has an Hamiltonian path, therefore we could miss 
some planar maps in our enumeration. However, most small planar graphs have 
an Hamiltonian path, for instance all planar cubic 3-connected graphs of size 
less than 38 [9] and, if Barnette’s conjecture holds, all fullerene graphs. 

The regularity of the graphs (all vertices of the same degree) crucially matters 
in the existence of an Hamiltonian path. Consider for instance the base of motifs 
M. = {I, Y} from Fig. All molecular maps based on M. are bipartite graphs: 
the I’s in one set of the bipartition and the Y’s in the other. But in saturated 
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maps of motifs, we have twice the number of Y equal three times the number of 
I because all vertices in V must be connected, therefore there are no Hamiltonian 
path except in graphs with exactly three I and two Y. This problem can be easily 
solved by building from A4 a new base of motifs which in the end generates the 
same molecular maps (see Sec. |^. 

Let us now explain how we generate all paths based on a set of motifs A4. 
We first build for each letter a € A a list La of all non isomorphic motifs whose 
first edge is incident to a vertex of label a. This data structure allows us to have 
a complexity independent of the size of A4 and of A. Then to build all possible 
paths of size n + 1 from a path of size n, we consider its last vertex c G 14 
and for each of the free vertex v connected to c and of color a, we attach every 
motif of La- Remark that beginning by the empty path, we generate all possible 
paths of a given size by applying recursively the algorithm. If we consider the 
paths as rooted at the first vertex produced during the algorithm, every path 
generated is clearly different. However, we can also consider the last concatenated 
vertex as the beginning of the path, which means we generate every path but 
the palindromes twice. To avoid that, we put an ordering on Am, the colors 
of the center vertices, and we consider the sequence of colors in a path. If the 
sequence of colors from the beginning to the end is lexicographically larger than 
the sequence from the end to beginning we output the path otherwise we do not. 
This is implemented in our algorithm and adds only in average a constant time. 

Proposition 1. The previous algorithm produces all maps of motifs which are 
paths without redundancies in constant amortized time, when in the base of motifs 
no two motifs of degree 2 can be concatenated. 

Proof. The tree of recursive calls of our algorithm can always be seen as of degree 
at least 3 by merging nodes of degree 2 to nodes of degree larger. Therefore it has 
at least as many internal nodes as leaves which correspond to output solutions. 
Since the algorithm needs only a constant time to go from one node to another, 
the generation of all paths can be done in constant amortized time. □ 

In our practical examples, there are never motifs of degree two which can be 
concatenated. Without this condition, the algorithm has still a linear delay. 


Hamiltonian cycles. If we want to further restrict the backbones we generate, 
a simple idea is to consider cycles instead of paths. Again it is a good choice if 
all motifs have the same degree or can be made so, since for instance all planar 
cubic 3-connected graphs of size less than 23 have an Hamiltonian cycle nni. 
Moreover, we will only generate 2-connected graphs and not the ones which are 
only I-connected. It is a desirable side effect, since those graphs have a bridge 
they are always the worse for the two main indices we are interested with, i.e. 
the minimum sparsity and the size of the largest cycle (see Sec. 3.4). 

In our implementation, we obtain the cycles by generating every path and 
by connecting their beginning to their end when possible. The same cycle can 
be obtained from several different paths (at most as much as its number of 
vertices). Therefore our algorithm is in linear amortized time. The question is, 
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can we generate all cycles with a CAT algorithm? It seems hard because we 
cannot fix a natural first vertex in a cycle as in a tree, since all its vertices can 
be isomorphic. 

Almost foldable paths. In each backbone we build, all free vertices will eventually 
be folded to get a saturated map of motifs. A simple necessary condition on the 
colors of a saturated map of motifs is that for each color a G A, there are as 
many vertices in V labeled by a and a. A backbone which satisfies this condition 
is said to be almost foldable. Let G be a map of motifs and let Oi,..., be the 
positive colors of the alphabet A. We denote by Cg the characteristic vector of 
G, it is of size k and its component is the number of elements in V labeled 
by Qi minus the number of elements labeled by di. Note that a map G is almost 
foldable if and only if Cq is the zero vector. 

We propose here a method to generate in constant amortized time only the 
almost foldable paths. We introduce a function F : N x A x —>■ 2-^ which has 
the following semantic: a' G F{n, a, (ci,..., c^)) if and only if (I) there is a path 
P of size n with a free vertex in the first motif labeled by a, (2) Cp = (ci,..., Ck), 
(3) a vertex of the last motif is labeled by o'. 

Proposition 2. There is an algorithm which enumerates all almost foldable 
paths in constant amortized time plus a precomputation in 0(n^+^), when in 
the base of motifs no two motifs of degree 2 can be concatenated. 

Proof. First, we explain how to generate all needed values of the function F 
in time 0(n*"'"^) by dynamic programming. Denote by / the maximal number 
of vertices in a motif labeled by the same color. For a path P of size n, it is 
clear that the coefficients in Cp are all in the interval [—nf,nf]. Therefore, to 
generate paths of size n, since / and the size of A are constants, we need to store 
O values of F only. 

F is easy to compute for n = I: we consider each motif M G M. and each v 
of label a in M, and let F(l, a, Cm) be the set of labels of all vertices of M but 
V. Assume we have generated the values of F for n, we generate the values for 
n + 1 in the following way. For each a, G and each a' G Fiji, a, G), we consider 
all motifs M G JA such that one of their vertex is labeled by d. We add all the 
labels of the other vertices to the set F{n, a,C + Cm)- This algorithm only does 
a constant number of operations for each value of F it computes, therefore its 
complexity is 0(n^+^). 

Now that F is computed, we use it in our path generation algorithm to 
generate only the almost foldable paths. Assume we have generated a path P 
of size n', its characteristic vector Cp and we want to add a node at the end 
by connecting it to a node of label a. Assume we have already computed Cp. 
The algorithm checks if F{n — n',d, —Cp) ^ 0. If it is the case the algorithm 
go on normally otherwise it backtracks since this extension cannot yield a non 
foldable path. This improvement only adds a single test at each step of the 
original algorithm, plus an addition of a constant sized vector to maintain the 
value of Cp. Therefore it is in constant amortized time. □ 
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The complexity of the precomputation may seem to be large but k must 
be seen as a small constant (less than 4). It is negligible with respect to the 
generation of paths, which is exponential in n because of the number of non 
isomorphic paths. In practice, the precomputation takes only a few milliseconds 
for size of graphs up to 40 on a regular desktop computer. On the other hand, 
this optimization makes the time to computes all the backbones much smaller 
than the time to do the next steps. 

Almost foldable trees. Following the idea used to efficiently compute almost 
foldable paths, we give here two ways to generate the almost foldable trees. 

When we extend a tree by a concatenation, it can be through any vertex. 
To keep the same dynamic programming algorithm as for paths we should track 
all free vertices in the tree in construction, which would make the algorithm 
exponential time. There are two solutions to this problem, the first and the 
one we have implemented is to compute a multidimensional array A such that 
A{n, C) = 1 if there is a forest F of size n such that Cp = C and A{n, C) = 0 
otherwise. We can thus test in our algorithm generating trees, whether any 
partial tree can be extended to a structure of the right size by a forest. Since 
we generate trees and not forests, we will sometimes expand a partial tree and 
obtain no almost foldable backbone in the end. 

The second solution is to change the characteristic vector of a backbone so 
that each of its component is the number of free vertices of some color positive 
or negative. In this way it is easy to compute an array A such that A{n, (7) = I if 
there is a tree T of size n such that Cp = C and A{n, (7) = 0 otherwise. Indeed, 
for each motif M with a free vertex of color a, if for some C A{n, (7) = 1 and C 
has a non-zero component d then there is a tree of size n+1 with vector C + Cm 
that is A{n, C -I- Cm) = 1- The only drawback is that the size of A and thus the 
complexity of the precomputation is where k is the number of positive 

colors while the size of A in the solution we have implemented is 

3.2 Folding of the backbones 

Let G be a map of motifs, the fold operation on the vertices u and v is adding 
the edge (u, v) to G. The operation is valid if u and v are free, of complementary 
colors and in the same face of G. Therefore, the graph obtained after the fold is 
still a map of motifs. In this section we generate from a backbone, by sequences 
of folds, all possible saturated maps of motifs. 

The outline of a face is the list in order of traversal of the free vertices. 
An outline is a circular sequence of vertices [vi,... ,Vn) € F". Sequence means 
that the order is significant and circular means that the starting point is not. 
For instance, {vi,V 2 ,V 3 ) and (v 3 ,vi,V 2 ) are the same circular sequence but are 
different from {v 3 ,V 2 ,Vi). Remark that a tree or a path has a single outline, 
a cycle has two and a saturated map has only empty outlines. The color of 
an outline (vi,... ,Vn) is the word wi... Wn with Wi the color of Vi. Folding two 
vertices Vi and Vj in the same outline of color WiWiW 2 WjW 3 creates two outlines 
of color IF 3 IF 1 and W 2 . The fold operation can then be seen as an operation from 



words over A to multiset of words. Remark that this operation is very similar to 
the reduction of consecutive complementary parentheses which enables to define 
the classical Dyck language of balanced string parentheses. 



Fig. 4. A map on Am = {VI, V2, J} and its outline before and after a fold operation. 


Applying a sequence of fold to a backbone to get a saturated map is the same 
as applying a sequence of reductions to the colors of an outline so that we obtain 
only empty words. We work from now on only on the words Wi ... w„ and on 
sequences of reductions. If in a sequence of reductions, the reduction is applied 
to iVi and wj we say that the sequence pairs i with j. 

Let us call a word (or a multiset of words) which reduces to a multiset of 
empty words a foldable word. As in the case of parentheses languages, we can 
restrict the reduction to consecutive complementary letters which transforms 
WiadW 2 into the word IV 1 IL 2 . Indeed, when a word is foldable, it can be reduced 
to empty words using the restricted reduction of consecutive letters only by 
reordering the sequence of reductions. We call result of a sequence of reductions 
the set of pairs (i,j) such that the sequence has paired i and j. The previous 
remark shows that it is indeed a set of pairs and not a sequence. Our aim is to 
generate all different results of sequences of reductions on foldable words without 
redundancies. 

Lemma 1 (Folklore). The restricted reduction on words is confluent i.e. each 
sequence of restricted reduction starting from a foldable word can be extended so 
that we get an empty word. 

Proof. To prove our lemma, it is enough to prove that if S is the sequence 
which reduces a word W = WiWiWi+iW 2 with Wi = wf+i, then W 1 W 2 is fold- 
able. If S pairs i and i + 1, then W 1 W 2 can be reduced to the empty word 
by S. We now assume that S pairs Wi with Wk and Wi+i with w;, where 
W = WfiVkWfwiWi+iWfwiWf. Remark that the case where S pairs Wi with 
wi and Wi+i with Wk is not possible because all letters between i and I must 
be paired together by definition and i -I- 1 is between i and I but not k. Inside 
the sequence S, we can find subsequences which reduce Wf, Wf and ivI^IT| 
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to empty words since we are allowed to reduce consecutive letters only. There¬ 
fore W 1 W 2 = WlwkWlW^wiW^ can be reduced to the empty word. First the 
sequences reducing WlWl are used to obtain the word WlwkWiW^- Then one 
step of reduction remove WkWi which are of complementary color by definition. 
Finally we obtain which is foldable. □ 

As a consequence of this lemma, we get a simple algorithm for testing whether 
a word is foldable: reduce the word as long as it is possible and if an empty word 
is obtained, the word is foldable. 

Proposition 3. There is a linear time algorithm to test whether a word is fold- 
able. 

Proof. The word is represented by a doubly linked list of its letters. At a given 
step of the algorithm we are at some position i in the list. If the letters at position 
i and j-|-l in the list are complementary, they are removed and i is set to be i —1 if 
possible, 0 otherwise. If the letters are not complementary, i is incremented. The 
algorithm stops and decides that the word is foldable when the list is empty. If i 
is at some point the last element of the list then the algorithm stops and decides 
that the word is not foldable. The algorithm is clearly in linear time, since at each 
step either the size of the list decreases or the current position increases. Finally 
this algorithm is correct, because if it stops without removing every element in 
the list, it means that there are no two consecutive complementary letters left. 
Therefore there are no possible further restricted reductions and the obtained 
word is not foldable. By Lemmasince the reduction is confluent, the original 
word is also not foldable. □ 

We use this algorithm each time we produce a backbone to test whether it 
can be folded into a saturated map of motifs. Note that, even if we generate 
almost foldable backbones only, we may generate some which are not foldable 
such as those with outline baba. 

Proposition 4. There is an algorithm which enumerates all distinct results of 
sequences of reduction on a foldable word, with a linear delay and a quadratic 
precomputation. 

Proof. For a given word W we first build the lists Li which contain the set of 
indices j > i such that Wi can be folded with Wj and the obtained set of words 
is still foldable. 

The lists Li are built from a boolean matrix M such that Mij is true if 
and only if the word Wi... Wj is foldable. The matrix is computed by dynamic 
programming: Mi^i+i is true if and only if Wi and Wi+i are complementary. We 
compute Mij once we have computed all Mi^ji such that (/ — i') < {j — i) by 
using the fact that Wi... Wj is foldable if and only iiwi.. .Wk and Wk+i.. .Wj are 
foldable for some k in [i-\-\,j] or Wi and Wj are complementary and Wi+i... wj-i 
is foldable. By this method, the matrix M is computed in time cubic in the size of 
the word. In fact, by Lemma[^ if there is a fc such that Wi.. .Wk and Wk+i.. .Wj 
are foldable, then for all I such that Wi.. .wi is foldable, then wz-i-i .. .Wj is 
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foldable. We store for each i the smallest k > i such that Wi.. .Wk is foldable. 
Hence we can decide whether there is a fc such that Wi.. .Wk is foldable in 
constant time and we compute the matrix M in quadratic time. 

Remark that a sequence of reductions applied to a word W yields a set of 
subwords which are consecutive letters of W. Therefore we can represent the 
result of several reductions by a set of pairs {(/i, ri),..., rfe)} with {li,ri) 
representing the word w;. ...Wn and li < Vi < k+i. We build the results of 
sequences of reductions in a recursive way. Assume we have already built a 
result R through a sequence of reductions applied to W, which has produced 
the set {{li,ri ),..., {lk,rk)}- We consider li, the index of the first letter which 
has not been reduced and we do the reduction with every possible letter of index 
i G [Zi, ri] n Lq which produces the set {(Zi, i), {i + 1, ri)..., (Zfe, r^)} and the 
result i?U{(Zi,*)}. By using recursively this algorithm starting on W, we obtain 
all possible results R corresponding to a reduction to a multiset of empty words. 
It is not possible to generate twice a result since at any point of the algorithm 
we make recursive calls on i?U {(Zi,i)} for different values of i which makes the 
results produced by each call disjoint. Between two recursive calls we do only a 
constant number of operations, therefore the delay is bounded by the depth of 
the tree of recursive calls, that is the size of the word W. □ 

The enumeration algorithm we have described is exponentially better than 
the naive one where each possible letter is folded when it is next to a comple¬ 
mentary letter and so on recursively. The complexity of the naive algorithm is 
proportional to the number of sequences of reductions while our is proportional 
to the number of results. For instance, on words of the form W" with W = aaaa, 
there is only one result but ( 2 n)! sequences of reductions. 

3.3 Dealing with isomorphic copies 

Since the construction process does not guaranty uniqueness of the generated 
maps, we need to detect during the enumeration the isomorphic copies of already 
generated maps to discard them. To do that we need to compute a unique 
signature for each map and we must store all produced maps and their signatures. 
Since the number of maps grows exponentially with their size, they are stored 
in a dynamic set structure which supports logarithmic addition and research of 
elements. In our implementation we have used an AVL whose key is the signature. 
Hence each time a new map is produced, we compute its signature and if this 
signature is already in the AVL, it is simply not inserted. 

From a theoretical point of view, planar isomorphism is well understood 
since it has been proved to be solvable in almost linear time El and logarithmic 
space |12j . However this algorithm is not practical and hard to implement as 
observed in na, especially if we want a signature rather than just an isomor¬ 
phism test. This is particularly true for our small graphs of size about 20, which 
is the reason why we rely on a simpler algorithm of quadratic complexity in the 
spirit of m- The idea is that in a map, when a first edge is fixed we can do a 
deterministic traversal of the graph using the order on each neighborhood. The 
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signature is the least lexicographic traversal amongst the traversals beginning 
by all edges of the map. 

Let us describe precisely the quadratic isomorphism algorithm. All the sig¬ 
natures are numbers in a base B with B = n + \A\ + \Am\- The first step that 
is common to all the maps of motifs of the same size n is to assign to each color 
in A and Am a different digit in [n,n + \ A\ + |Am|) in base B. In a map of 
motifs G = ( 14 , A, next) of size n and for any edge {c,u) G E with c € 14 
we perform a deterministic depth first search that will define the signature of 
G starting at (c,u). Since signatures are numbers, they can be easily compared 
and the signature of G will be the minimum number over all starting points. 

For computing a signature starting at (c, u) , at first visit of each vertex in 14 
assign an index number that is a digit in the range [0, n) in the base B. From c 
visit its neighbor u: since the map is saturated u is connected to a vertex v € V 
and V is connected to a vertex c' G I 4 . Construct the signature by concatenating 
the index number of c, the digits of the colors c, u, v and c' and the index number 
of c'. If c' is already visited backtrack and continue the visit from (c, next{{c, u))) 
else continue the visit starting at {c',v') with {c',v') = next((c',u)) and so 
on until all quadruplets (c, u, v, c') are visited once. At the end, we obtain a 
signature in base B for the starting point (c, u). Note that the signature itself is 
of size linear in n. Given any signature one may exactly reconstruct the graph. 
Conversely two graphs which are isomorphic have the same signature because 
the signature computation does not take into account the order or name of the 
nodes. 

We make a simple optimization, which is crucial, since profiling our algorithm 
reveals that it spends more than half of its time computing signatures. We assign 
the lower digits to the colors of c and u such that the number of couples (c, u) 
is minimal and non zero. Since the signatures are constructed with the most 
significant bit first, during the construction of a signature, we test for each digit 
added if the signature is at this point greater than the minimal one. Thus we 
can cut very efficiently in the signature calculation process. 

Moreover, the computed signature allows to detect chiral molecules, a very 
important notion in chemistry. Two maps are chiral if one is isomorphic to the 
other when the order of the next predicate is reversed for all neighborhoods. 

3.4 Indices computed on the molecular map 

A molecular map is a candidate to be a “good” cage for chemistry. The definition 
of a “good” cage is merely topological: the 3D shape must be close to a sphere, it 
must be resistant to deformations and cuts and it must have an ’’entrance”. We 
are able to check if a molecule satisfies or not these requirements only by consid¬ 
ering the structure of its molecular map: First the map is planar and connected 
by construction. In quadratic time we compute the equivalence classes of vertices 
up to automorphism, using the same technique as to compute a signature, which 
helps measure the sphericity of the cage. The entrance is given by the size of its 
largest face, which is easily computed in linear time. The resistance of a map is 
given by its minimum sparsity. 
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From a large set of experiments, these indices have proved to be realistic 
to the chemist on several examples (see Sec. |^. They are then used in our 
implementation to limit the number of molecular maps output by the program, 
which would otherwise be in such great number that a chemist could not try to 
study them all. For instance, all maps with a small minimum sparsity are filtered 
out. 

Distribution of the sizes of faces The faces size is an important parameter in 
the cage construction. The chemist wants a cage with an ’’entrance”. In graph 
terms we seek for graphs with one large face and all the others faces of size 
around the mean size, which makes the molecule more spherical in practice. The 
distribution of the face size is straightforward to compute. As an indicator we 
compute the size difference between the two largest faces divided by the mean 
size. This indicator is zero when there is two largest faces with the same size 
and grows with the entrance size. 

Equivalence classes of the vertices Two vertices (motifs) of a molecular map are 
in the same class if it exists an automorphism that send one to the other. We 
compute the equivalence classes of all vertices: If the signature starting form 
(ci,ui) is equal to the signature starting at (c 2 ,U 2 ) the motif centered on ci is 
in the same class as he motif centered on C 2 . The chemist, when synthesizing a 
molecule corresponding to a molecular map, will use the same compound for all 
motifs in the same equivalence class. In addition the less the number of classes 
the more the molecule has a spherical shape. 

Minimum sparsity We now define the sparsity and explain how to compute 
it, since it is the most relevant index and the hardest to compute. A cut of a 
graph G = {V,E) is a bipartition of V. The size of a cut S = {Si,S 2 ) is the 
number of edges with one end in Si and the other in S 2 - The sparsity of a cut is 
sparsity{S) = |) ■ The Sparsest Cut problem is to find the minimum 

sparsity over all cuts. We first implemented a brute-force algorithm, using a 
Gray code which enumerates all possible partitions of the set of vertices in time 
0(2") where n is the number of vertices in our graph. Since we were using a 
Gray code, the partition changes at each step by only one element and the cut 
can be computed in constant time from the previous one. Therefore we have a 
simple algorithm with complexity 0(2") where n is the number of vertices in 
our graph, which is useful for n up to twenty but not practical for larger sizes. 

Although computing the minimum sparsity is NP-complete in general (min¬ 
imum cut into bounded set in |15jb there is a polynomial time algorithm when 
the graph is planar |16j . Since the time to compute the minimum sparsity was 
the limiting factor of our program, we have implemented and adapted to our 
case this more complicated algorithm (which has never been done as far as we 
know). 

The main idea is that a cut in a graph corresponds exactly to a cycle in the 
dual graph (see im for graph definitions useful in this paragraph). A weight 
is associated to each cycle of the dual: if the corresponding cut in the primal 
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partitions it into Si and S 2 , the weight is minds'll, IS 2 I). From a spanning tree 
of the dual, we build a base of its fundamental cycles. A fundamental cycle is 
given by any edge not in the spanning tree completed by edges of the spanning 
tree to form a minimal cycle. From symmetric differences of fundamental cycles, 
we can generate every cycle and its weight. 

For each edge in the dual, we build a graph such that paths from a given 
vertex correspond to cycles of the dual which use the edge. Moreover, the weight 
of the cycle can be read in the last vertex of the path, and the size of the 
corresponding cut is the length of the path. Therefore, computing a single source 
shortest-path in each of these graphs enables us to compute the value of the 
sparsest-cut. While in the original article this was done by a modified Dijkstra 
algorithm, we use a breadth first-search. This is faster and it enables us to use 
a good heuristic: at any point of one of the breadth first-search, we know the 
current distance from the source can only increase. We can stop the search, if 
this distance divided by the maximal weight (equal to the number of vertices) is 
larger than the current minimum sparsity value. This implementation has very 
good practical performances: on a regular desktop computer the mean time 
to compute the sparsest cut of a graph of size 30 is 0.2 ms while the brute force 
algorithm needs 6000 ms. 


4 Metamotifs 

From a base of motifs, we can generate a new one, by concatenation of elements 
of the base. The new motifs are called metamotifs. It is useful, if the new elements 
added to the base can be used to remove other elements of the base so that some 
good properties are enforced. 

For instance, one can remove the elements of degree 2 (if they cannot be 
concatenated together), while not increasing the degree of motifs in the base. 
Every motif of degree 2 is concatenated in every possible way to the other motifs 
and deleted. From our example {I, Y}, we obtain a base {Yo,Yi,Y 2 ,Y 3 } where 
the Yi are of degree 3 and have i vertices of V labeled by a and the others by 
a. If we now generate all molecular maps of size n based on {Yq, Yi, Y 2 , Y 3 } it 
is easy to convert them into molecular maps based on Ai. The converted maps 
are of size exactly |n since there are | I for each Y. 



Fig. 5. Representation of the two metamotifs Yi and Y 3 , built from Y and I 
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Note that the isomorphism test is done on the generated map of motifs seen 
as made of the motifs of the first base, otherwise we could not detect some 
isomorphic copies. 

The choice of a new base can also be interesting if it decreases its size or 
the size of the alphabet. It is a way to encode constraints on some specific base 
understood by the user. For instance the base X (a, a, a, a), V (a, a, b) and I (6, b) 
can be turned into the base XI (a, a, a, a), X2 (a, d, d, d) because with I we can 
only connect two V. It is now easy to see that we are generating the 4-regular 
planar bipartite maps. In that particular case, the efficiency of our algorithm is 
not improved since the generated paths are the same. 


5 Results 

The code and the exhaustive results of our approach can be found at the following 
address http://kekule.prism.uvsq.fr. For several sets of motifs, one can find 
the set of generated maps and their indices. We stopped all computations at 300 
seconds an put a - in the tables when the algorithm has not finished. All times 
are given in second, a.f. stands for almost foldable. 


Table 1. Number of backbones and generation time for J (a, 6), VI (a,a,6), V2 {a,b,b) 


Size 

Tree 

A.f. tree 

Path 

A.f. path 


Backbones 

Time 

Backbones 

Time 

Backbones 

Time 

Backbones 

Time 

9 

5.70 10^’ 

0.09 

3.85 10® 

0.05 

4.92 10'‘ 

0.01 

9.87 10® 

0.01 

12 

1.16 10® 

14.28 

5.55 lO’’ 

7.98 

1.77 10® 

0.28 

2.46 10® 

0.08 

15 

- 

- 

_ 

- 

7.26 lO’’ 

10.88 

6.17 10® 

1.74 

18 

- 

- 

- 

- 

- 

- 

1.56 10® 

45.84 


In Tab. we give the time to compute the backbones and the number of 
backbones generated (we also count isomorphic copies which are generated). 
The time to compute cycles is not given since they are computed from paths, 
the difference is seen in the number of folded maps and the time to generate 
them. In Tab.[^ we give the time to generate all unique maps and their indices. 


Table 2. Number of maps and time to generate them and their indices for J {a,b), 
VI (d,d,b), V2 (a,b,b) 


Size 

A.f. tree 

A.f. path 

A.f. cycle 


A.f. backb. 

Maps 

Time 

A.f. backb. 

Maps 

Time 

A.f. backb. 

Maps 

Time 

9 

3.85 10® 

236 

0.32 

9.87 10® 

236 

0.03 

8.06 10® 

148 

0.01 

12 

5.55 10'^ 

4476 

53.99 

2.46 10® 

4463 

0.71 

2.03 10® 

1931 

0.32 

15 

- 

> 98100 

- 

6.17 10® 

97112 

28.40 

5.13 10® 

29164 

8.81 

18 

- 

- 

- 

1.56 10® 

2307686 

- 

1.30 10® 

501503 

184.48 


15 
































Remark that the number of unique maps generated by trees, paths or cycles are 
different, since only the generation from trees is exhaustive. However, most of 
the maps with the largest minimum sparsity are generated with paths or cycles 
as backbones. 

6 Chemical validation 

Using the set of motifs {X, I}, if we take for each size of maps the ones with the 
lowest cut indices, we find the molecules obtained by Warmuth and Liu (Solvent 
effects in thermodynamically controlled multicomponent nanocage syntheses) in 
real-life experiments. An example of a molecular map built on {X, 1} is given in 
Fig.§ (in 3 dimension for easier reading). The white elements are X and the red 
I. Its chemical realization by Warmuth and Liu is also given in the same figure. 



Fig. 6. A cage obtained by Warmuth with 6 X and 12 I 


From all maps of size 8 based on Y (a, a, a), VI (a, b, b) and V2 (d, b, b), we 
have selected the map of Fig. [^because it has good indices. This has led to the 
conception of a real molecule which can be represented by this molecular map. 
It is given in Fig. the blue parts being the Y, the black parts the VI and the 
green parts the V2. 
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